SelectivePCA's weight=True capabilitySome algorithms intrinsically treat each feature with the same amount of importance. For many such algorithms, i.e., clustering algorithms, this is a fallacy and can cause inappropriate results. The following notebook demonstrates skutil's weighting capability via SelectivePCA
In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
sklearn.__version__
Out[1]:
In [2]:
iris = load_iris()
X, y = iris.data, iris.target # this is unsupervised; we aren't going to split
In [5]:
from sklearn.metrics import accuracy_score
from skutil.decomposition import SelectivePCA
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
# define our default pipe
pca = SelectivePCA(n_components=0.99)
pipe = Pipeline([
('pca', pca),
('model', KMeans(3))
])
# fit the pipe
pipe.fit(X, y)
# predict and score
print('Train accuracy: %.5f' % accuracy_score(y, pipe.predict(X)))
This is a nice accuracy, but not a stellar one... Surely we can improve this, right? Part of the problem is that clustering (distance metrics) treats all the features equally. Since PCA intrinsically orders features based on importance, we can weight them according to the variability they each explain. Thus, the most important features will be up weighted, and the least important features will be down weighted.
Here is the explained_variance_ratio_ vector:
In [6]:
pca.pca_.explained_variance_ratio_
Out[6]:
And here's what our weighting vector will ultimately look like:
In [7]:
weights = pca.pca_.explained_variance_ratio_
weights -= np.median(weights)
weights += 1
weights
Out[7]:
In [10]:
# define our weighted pipe
pca = SelectivePCA(n_components=0.99, weight=True)
pipe = Pipeline([
('pca', pca),
('model', KMeans(3))
])
# fit the pipe
pipe.fit(X, y)
# predict and score
print('Train accuracy (with weighting): %.5f' % accuracy_score(y, pipe.predict(X)))
Note that this is not limited just to KMeans or even to clustering tasks. Any algorithm that does not intrinsically perform any kind of regularization or other feature selection may be subject to this trap, and SelectivePCA's weighting can help!